29 research outputs found
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Large language models are increasingly trained on all the data ever produced
by humans. Many have raised concerns about the trustworthiness of public
benchmarks due to potential contamination in pre-training or fine-tuning
datasets. While most data decontamination efforts apply string matching (e.g.,
n-gram overlap) to remove benchmark data, we show that these methods are
insufficient, and simple variations of test data (e.g., paraphrasing,
translation) can easily bypass these decontamination measures. Furthermore, we
demonstrate that if such variation of test data is not eliminated, a 13B model
can easily overfit a test benchmark and achieve drastically high performance,
on par with GPT-4. We validate such observations in widely used benchmarks such
as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a
stronger LLM-based decontamination method and apply it to widely used
pre-training and fine-tuning datasets, revealing significant previously unknown
test overlap. For example, in pre-training sets such as RedPajama-Data-1T and
StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps.
Interestingly, we also find such contamination in synthetic dataset generated
by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We
urge the community to adopt stronger decontamination approaches when using
public benchmarks. Moreover, we call for the community to actively develop
fresh one-time exams to evaluate models accurately. Our decontamination tool is
publicly available at https://github.com/lm-sys/llm-decontaminator
On Optimal Caching and Model Multiplexing for Large Model Inference
Large Language Models (LLMs) and other large foundation models have achieved
noteworthy success, but their size exacerbates existing resource consumption
and latency challenges. In particular, the large-scale deployment of these
models is hindered by the significant resource requirements during inference.
In this paper, we study two approaches for mitigating these challenges:
employing a cache to store previous queries and learning a model multiplexer to
choose from an ensemble of models for query processing.
Theoretically, we provide an optimal algorithm for jointly optimizing both
approaches to reduce the inference cost in both offline and online tabular
settings. By combining a caching algorithm, namely Greedy Dual Size with
Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we
achieve optimal rates in both offline and online settings. Empirically,
simulations show that the combination of our caching and model multiplexing
algorithms greatly improves over the baselines, with up to
improvement over the baseline when the ratio between the maximum cost and
minimum cost is . Experiments on real datasets show a
improvement in FLOPs over the baseline when the ratio for FLOPs is , and a
improvement in latency when the ratio for average latency is
Overestimation of thermal emittance in solenoid scans due to coupled transverse motion
The solenoid scan is a widely used method for the in-situ measurement of the
thermal emittance in a photocathode gun. The popularity of this method is due
to its simplicity and convenience since all rf photocathode guns are equipped
with an emittance compensation solenoid. This paper shows that the solenoid
scan measurement overestimates the thermal emittance in the ordinary
measurement configuration due to a weak quadrupole field (present in either the
rf gun or gun solenoid) followed by a rotation in the solenoid. This coupled
transverse dynamics aberration introduces a correlation between the beam's
horizontal and vertical motion leading to an increase in the measured 2D
transverse emittance, thus the overestimation of the thermal emittance. This
effect was systematically studied using both analytic expressions and numerical
simulations. These studies were experimentally verified using an L-band
1.6-cell rf photocathode gun with a cesium telluride cathode, which shows a
thermal emittance overestimation of 35% with a rms laser spot size of 2.7 mm.
The paper concludes by showing that the accuracy of the solenoid scan can be
improved by using a quadrupole magnet corrector, consisting of a pair of normal
and skew quadrupole magnets.Comment: 12 pages, 13 figure
Factorized Q-Learning for Large-Scale Multi-Agent Systems
Deep Q-learning has achieved significant success in single-agent decision
making tasks. However, it is challenging to extend Q-learning to large-scale
multi-agent scenarios, due to the explosion of action space resulting from the
complex dynamics between the environment and the agents. In this paper, we
propose to make the computation of multi-agent Q-learning tractable by treating
the Q-function (w.r.t. state and joint-action) as a high-order high-dimensional
tensor and then approximate it with factorized pairwise interactions.
Furthermore, we utilize a composite deep neural network architecture for
computing the factorized Q-function, share the model parameters among all the
agents within the same group, and estimate the agents' optimal joint actions
through a coordinate descent type algorithm. All these simplifications greatly
reduce the model complexity and accelerate the learning process. Extensive
experiments on two different multi-agent problems demonstrate the performance
gain of our proposed approach in comparison with strong baselines, particularly
when there are a large number of agents.Comment: 7 pages, 5 figures, DAI 201